Goto

Collaborating Authors

 Aircraft


UAV3D: A Large-scale 3D Perception Benchmark for Unmanned Aerial Vehicles

Neural Information Processing Systems

Unmanned Aerial Vehicles (UAVs), equipped with cameras, are employed in numerous applications, including aerial photography, surveillance, and agriculture. In these applications, robust object detection and tracking are essential for the effective deployment of UAVs. However, existing benchmarks for UAV applications are mainly designed for traditional 2D perception tasks, restricting the development of real-world applications that require a 3D understanding of the environment. Furthermore, despite recent advancements in single-UAV perception, limited views of a single UAV platform significantly constrain its perception capabilities over long distances or in occluded areas. To address these challenges, we introduce UAV3D - a benchmark designed to advance research in both 3D and collaborative 3D perception tasks with UAVs. UAV3D comprises 1,000 scenes, each of which has 20 frames with fully annotated 3D bounding boxes on vehicles. We provide the benchmark for four 3D perception tasks: single-UAV 3D object detection, single-UAV object tracking, collaborative-UAV 3D object detection, and collaborative-UAV object tracking.


Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Neural Information Processing Systems

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g.


Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Neural Information Processing Systems

Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in fulldataset training of FGVC models remains under-explored.


Highlights From Starships Test Flight 9: Everything That Happened in 17 Minutes

Mashable

Highlights From Starship's Test Flight 9: Everything That Happened in 17 Minutes Mashable Tech Science Life Social Good Entertainment Deals Shopping Games Search Cancel * * Search Result Tech Apps & Software Artificial Intelligence Cybersecurity Cryptocurrency Mobile Smart Home Social Media Tech Industry Transportation All Tech Science Space Climate Change Environment All Science Life Digital Culture Family & Parenting Health & Wellness Sex, Dating & Relationships Sleep Careers Mental Health All Life Social Good Activism Gender LGBTQ Racial Justice Sustainability Politics All Social Good Entertainment Games Movies Podcasts TV Shows Watch Guides All Entertainment SHOP THE BEST Laptops Budget Laptops Dating Apps Sexting Apps Hookup Apps VPNs Robot Vaccuums Robot Vaccum & Mop Headphones Speakers Kindles Gift Guides Mashable Choice Mashable Selects All Sex, Dating & Relationships All Laptops All Headphones All Robot Vacuums All VPN All Shopping Games Product Reviews Adult Friend Finder Bumble Premium Tinder Platinum Kindle Paperwhite PS5 vs PS5 Slim All Reviews All Shopping Deals Newsletters VIDEOS Mashable Shows All Videos Home Science Space Highlights From Starship's Test Flight 9: Everything That Happened in 17 Minutes Starship Test Flight 9 ends with "confirmation that the booster did demise." By Mashable Video on May 28, 2025 Share on Facebook Share on Twitter Share on Flipboard Watch Next Qualcomm's 2025 Computex Highlights: Everything Announced in 20 Minutes 20:09 Everything Announced at AMD's 2025 Computex Keynote in 19 Minutes 19:31 Everything Revealed at Nvidia's 2025 Computex Press Conference in 19 Minutes 19:55 Microsoft Build 2025 keynote: Everything announced, in 14 minutes 14:43 SpaceX conducted its ninth test flight of the Starship Launch Vehicle atop a Falcon Heavy booster from Starbase, Texas. See all the highlights from the test launch. Topics SpaceX Elon Musk Rocket Launches Latest Videos'Good Fortune' trailer: Keanu Reeves plays a guardian angel in Aziz Ansari's directorial debut Keanu Reeves as an angel? And, well... 05/24/2025 By Leah Stodart Say More: R.L. Stine on'Fear Street: Prom Queen' and Matt Wolf on'Pee-wee as Himself' From teen-defining terror to a childhood icon, '80s nostalgia thrives.


AVOIDDS: Aircraft Vision-based Intruder Detection Dataset and Simulator

Neural Information Processing Systems

Designing robust machine learning systems remains an open problem, and there is a need for benchmark problems that cover both environmental changes and evaluation on a downstream task. In this work, we introduce AVOIDDS, a realistic object detection benchmark for the vision-based aircraft detect-and-avoid problem. We provide a labeled dataset consisting of 72,000 photorealistic images of intruder aircraft with various lighting conditions, weather conditions, relative geometries, and geographic locations. We also provide an interface that evaluates trained models on slices of this dataset to identify changes in performance with respect to changing environmental conditions. Finally, we implement a fullyintegrated, closed-loop simulator of the vision-based detect-and-avoid problem to evaluate trained models with respect to the downstream collision avoidance task. This benchmark will enable further research in the design of robust machine learning systems for use in safety-critical applications.


LLaNA: Large Language and NeRF Assistant

Neural Information Processing Systems

Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRFlanguage assistant capable of performing new tasks such as NeRF captioning and Q&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.




AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Neural Information Processing Systems

Aircraft design encompasses different physics domains and, hence, multiple modalities of representation. The evaluation of these cyber-physical system (CPS) designs requires the use of scientific analytical and simulation models ranging from computeraided design tools for structural and manufacturing analysis, computational fluid dynamics tools for drag and lift computation, battery models for energy estimation, and simulation models for flight control and dynamics. AircraftVerse contains 27,714 diverse air vehicle designs - the largest corpus of engineering designs with this level of complexity. Each design comprises the following artifacts: a symbolic design tree describing topology, propulsion subsystem, battery subsystem, and other design details; a STandard for the Exchange of Product (STEP) model data; a 3D CAD design using a stereolithography (STL) file format; a 3D point cloud for the shape of the design; and evaluation results from high fidelity state-of-the-art physics models that characterize performance metrics such as maximum flight distance and hover-time. We also present baseline surrogate models that use different modalities of design representation to predict design performance metrics, which we provide as part of our dataset release. Finally, we discuss the potential impact of this dataset on the use of learning in aircraft design and, more generally, in CPS. AircraftVerse is accompanied by a data card, and it is released under Creative Commons Attribution-ShareAlike (CC BY-SA) license.